Linear Regression

By Evgenia "Jenny" Nitishinskaya and Delaney Granizo-Mackenzie with example algorithms by David Edwards

Part of the Quantopian Lecture Series:

Notebook released under the Creative Commons Attribution 4.0 License.

Linear regression is a technique that measures the relationship between two variables. If we have an independent variable $X$, and a dependent outcome variable $Y$, linear regression allows us to determine which linear model $Y = \alpha + \beta X$ best explains the data. As an example, let's consider TSLA and SPY. We would like to know how TSLA varies as a function of how SPY varies, so we will take the daily returns of each and regress them against each other.

Python's statsmodels library has a built-in linear fit function. Note that this will give a line of best fit; whether or not the relationship it shows is significant is for you to determine. The output will also have some statistics about the model, such as R-squared and the F value, which may help you quantify how good the fit actually is.



In [1]:

    
# Import libraries
import numpy as np
from statsmodels import regression
import statsmodels.api as sm
import matplotlib.pyplot as plt
import math

First we'll define a function that performs linear regression and plots the results.



In [2]:

    
def linreg(X,Y):
    # Running the linear regression
    X = sm.add_constant(X)
    model = regression.linear_model.OLS(Y, X).fit()
    a = model.params[0]
    b = model.params[1]
    X = X[:, 1]

    # Return summary of the regression and plot results
    X2 = np.linspace(X.min(), X.max(), 100)
    Y_hat = X2 * b + a
    plt.scatter(X, Y, alpha=0.3) # Plot the raw data
    plt.plot(X2, Y_hat, 'r', alpha=0.9);  # Add the regression line, colored in red
    plt.xlabel('X Value')
    plt.ylabel('Y Value')
    return model.summary()

Now we'll get pricing data on TSLA and SPY and perform a regression.



In [3]:

    
start = '2014-01-01'
end = '2015-01-01'
asset = get_pricing('TSLA', fields='price', start_date=start, end_date=end)
benchmark = get_pricing('SPY', fields='price', start_date=start, end_date=end)

# We have to take the percent changes to get to returns
# Get rid of the first (0th) element because it is NAN
r_a = asset.pct_change()[1:]
r_b = benchmark.pct_change()[1:]

linreg(r_b.values, r_a.values)









    Out[3]:





OLS Regression Results

  Dep. Variable:             y           R-squared:             0.202


  Model:                    OLS          Adj. R-squared:        0.199


  Method:              Least Squares     F-statistic:           63.14


  Date:              Tue, 04 Aug 2015    Prob (F-statistic):  6.66e-14


  Time:                  18:16:38        Log-Likelihood:       548.81


  No. Observations:          251         AIC:                  -1094.


  Df Residuals:              249         BIC:                  -1087.


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




           coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  const      0.0011      0.002      0.626   0.532     -0.002     0.004


  x1         1.9271      0.243      7.946   0.000      1.449     2.405




  Omnibus:        63.947    Durbin-Watson:         2.006


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    371.184


  Skew:            0.850    Prob(JB):           2.50e-81


  Kurtosis:        8.710    Cond. No.               141.

Each point on the above graph represents a day, with the x-coordinate being the return of SPY, and the y-coordinate being the return of TSLA. As we can see, the line of best fit tells us that for every 1% increased return we see from the SPY, we should see an extra 1.92% from TSLA. This is expressed by the parameter $\beta$, which is 1.9271 as estimated. Of course, for decresed return we will also see about double the loss in TSLA, so we haven't gained anything, we are just more volatile.

Linear Regression vs. Correlation

Linear regression gives us a specific linear model, but is limited to cases of linear dependence.
Correlation is general to linear and non-linear dependencies, but doesn't give us an actual model.
Both are measures of covariance.
Linear regression can give us relationship between Y and many independent variables by making X multidimensional.

Knowing Parameters vs. Estimates

It is very important to keep in mind that all $\alpha$ and $\beta$ parameters estimated by linear regression are just that - estimates. You can never know the underlying true parameters unless you know the physical process producing the data. The paremeters you estimate today may not be the same as the same analysis done including tomorrow's data, and the underlying true parameters may be moving. As such it is very important when doing actual analysis to pay attention to the standard error of the parameter estimates. More material on the standard error will be presented in a later lecture. One way to get a sense of how stable your paremeter estimates are is to estimates them using a rolling window of data and see how much variance there is in the estimates.

Ordinary Least Squares

Regression works by optimizing the placement of the line of best fit (or plane in higher dimensions). It does so by defining how bad the fit is using an objective function. In ordinary least squares regression, a very common type and what we use here, the objective function is:

$$\sum_{i=1}^n (Y_i - a - bX_i)^2$$

That is, for each point on the line of best fit, compare it with the real point and take the square of the difference. This function will decrease as we get better paremeter estimates. Regression is a simple case of numerical optimzation that has a closed form solution and does not need any optimizer.

Example case

Now let's see what happens if we regress two purely random variables.



In [4]:

    
X = np.random.rand(100)
Y = np.random.rand(100)
linreg(X, Y)









    Out[4]:





OLS Regression Results

  Dep. Variable:             y           R-squared:             0.000


  Model:                    OLS          Adj. R-squared:       -0.010


  Method:              Least Squares     F-statistic:        0.006141


  Date:              Tue, 04 Aug 2015    Prob (F-statistic):    0.938 


  Time:                  18:16:38        Log-Likelihood:      -4.1693


  No. Observations:          100         AIC:                   12.34


  Df Residuals:               98         BIC:                   17.55


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




           coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  const      0.4267      0.052      8.270   0.000      0.324     0.529


  x1         0.0069      0.089      0.078   0.938     -0.169     0.183




  Omnibus:         5.412    Durbin-Watson:         2.113


  Prob(Omnibus):   0.067    Jarque-Bera (JB):      4.367


  Skew:            0.404    Prob(JB):              0.113


  Kurtosis:        2.372    Cond. No.               4.43

The above shows a fairly uniform cloud of points. It is important to note that even with 100 samples, the line has a visible slope due to random chance. This is why it is crucial that you use statistical tests and not visualizations to verify your results.

Now let's make Y dependent on X plus some random noise.



In [5]:

    
# Generate ys correlated with xs by adding normally-destributed errors
Y = X + 0.2*np.random.randn(100)

linreg(X,Y)









    Out[5]:





OLS Regression Results

  Dep. Variable:             y           R-squared:             0.713


  Model:                    OLS          Adj. R-squared:        0.710


  Method:              Least Squares     F-statistic:           243.9


  Date:              Tue, 04 Aug 2015    Prob (F-statistic):  2.41e-28


  Time:                  18:16:39        Log-Likelihood:       20.885


  No. Observations:          100         AIC:                  -37.77


  Df Residuals:               98         BIC:                  -32.56


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




           coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  const     -0.0600      0.040     -1.494   0.138     -0.140     0.020


  x1         1.0773      0.069     15.619   0.000      0.940     1.214




  Omnibus:         3.009    Durbin-Watson:         2.215


  Prob(Omnibus):   0.222    Jarque-Bera (JB):      2.475


  Skew:            0.373    Prob(JB):              0.290


  Kurtosis:        3.197    Cond. No.               4.43

In a situation like the above, the line of best fit does indeed model the dependent variable Y quite well (with a high $R^2$ value).

Evaluating and reporting results

The regression model relies on several assumptions:

The independent variable is not random.
The variance of the error term is constant across observations. This is important for evaluating the goodness of the fit.
The errors are not autocorrelated. The Durbin-Watson statistic detects this; if it is close to 2, there is no autocorrelation.
The errors are normally distributed. If this does not hold, we cannot use some of the statistics, such as the F-test.

If we confirm that the necessary assumptions of the regression model are satisfied, we can safely use the statistics reported to analyze the fit. For example, the $R^2$ value tells us the fraction of the total variation of $Y$ that is explained by the model.

When making a prediction based on the model, it's useful to report not only a single value but a confidence interval. The linear regression reports 95% confidence intervals for the regression parameters, and we can visualize what this means using the seaborn library, which plots the regression line and highlights the 95% (by default) confidence interval for the regression line:



In [6]:

    
import seaborn

start = '2014-01-01'
end = '2015-01-01'
asset = get_pricing('TSLA', fields='price', start_date=start, end_date=end)
benchmark = get_pricing('SPY', fields='price', start_date=start, end_date=end)

# We have to take the percent changes to get to returns
# Get rid of the first (0th) element because it is NAN
r_a = asset.pct_change()[1:]
r_b = benchmark.pct_change()[1:]

seaborn.regplot(r_b.values, r_a.values);

We can also find the standard error of estimate, which measures the standard deviation of the error term $\epsilon$, by getting the scale parameter of the model returned by the regression and taking its square root. The formula for standard error of estimate is $$ s = \left( \frac{\sum_{i=1}^n \epsilon_i^2}{n-2} \right)^{1/2} $$

This is useful because the variance of the error in our prediction $\hat{Y}$ given the value $X$ is $$ s_f^2 = s^2 \left( 1 + \frac{1}{n} + \frac{(X - \mu_X)^2}{(n-1)\sigma_X^2} \right) $$

where $\mu_X$ is the mean of our observations of $X$ and $\sigma_X$ is the standard deviation. Then the 95% confidence interval for the prediction is $\hat{Y} \pm t_cs_f$, where $t_c$ is the critical value of the t-statistic for $n$ samples and a desired 95% confidence.

Mathematical Background

This is a very brief overview of linear regression. For more, please see: https://en.wikipedia.org/wiki/Linear_regression

Dep. Variable:	y	R-squared:	0.202
Model:	OLS	Adj. R-squared:	0.199
Method:	Least Squares	F-statistic:	63.14
Date:	Tue, 04 Aug 2015	Prob (F-statistic):	6.66e-14
Time:	18:16:38	Log-Likelihood:	548.81
No. Observations:	251	AIC:	-1094.
Df Residuals:	249	BIC:	-1087.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	0.0011	0.002	0.626	0.532	-0.002 0.004
x1	1.9271	0.243	7.946	0.000	1.449 2.405

Omnibus:	63.947	Durbin-Watson:	2.006
Prob(Omnibus):	0.000	Jarque-Bera (JB):	371.184
Skew:	0.850	Prob(JB):	2.50e-81
Kurtosis:	8.710	Cond. No.	141.

Omnibus:	5.412	Durbin-Watson:	2.113
Prob(Omnibus):	0.067	Jarque-Bera (JB):	4.367
Skew:	0.404	Prob(JB):	0.113
Kurtosis:	2.372	Cond. No.	4.43

Dep. Variable:	y	R-squared:	0.713
Model:	OLS	Adj. R-squared:	0.710
Method:	Least Squares	F-statistic:	243.9
Date:	Tue, 04 Aug 2015	Prob (F-statistic):	2.41e-28
Time:	18:16:39	Log-Likelihood:	20.885
No. Observations:	100	AIC:	-37.77
Df Residuals:	98	BIC:	-32.56
Df Model:	1
Covariance Type:	nonrobust

Omnibus:	3.009	Durbin-Watson:	2.215
Prob(Omnibus):	0.222	Jarque-Bera (JB):	2.475
Skew:	0.373	Prob(JB):	0.290
Kurtosis:	3.197	Cond. No.	4.43